Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record)#1316
Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record)#1316AR6420 wants to merge 2 commits intoopenai:mainfrom
Conversation
Tile engine-inspired block-level Triton fusion for the 10min/16MB track: - Full-depth MLP megakernel: 5 ops (RMSNorm → UpProj → LeakyReLU² → DownProj → Residual) fused into 1 Triton kernel. The 1536-dim intermediate is processed via tiled register accumulation and never materializes in HBM. Deeper than PR openai#1072. - Fused attention preprocessing: QK RMSNorm + partial RoPE + q_gain in 2 Triton kernels (down from 6+). Novel — nobody in competition fuses post-projection ops. - 41% memory reduction (1562 MiB vs 2656 MiB). Numerically exact (cos_sim>0.99998). - Based on PR openai#1019 (abaybektursun). H100 results PENDING. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
H100 results (8xH100 SXM, 600s): - Sliding BPB: 1.1310 (seed 1337) - Steps: 4,917 at 122.0 ms/step - Artifact: 15.6 MB Key finding: megakernel is 41% slower on H100 (122ms vs SOTA 86.7ms). cuBLAS tensor cores outperform tiled tl.dot even with larger SRAM. 41% memory reduction confirmed (15.7 GiB / 80 GiB). Single-seed submission — seeds 42/2025 blocked by compute budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record)BPB: 1.1310 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.48s, dim=512, layers=11, vocab=1024, code=101703 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.48s, dim=512, layers=11, vocab=1024, code=101703 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Full-Depth MLP Megakernel + Fused Attention Preprocessing
val_bpb: 1.1310 (1-seed, SEED=1337) | 15.6 MB | 8xH100 SXM, 600s | Track: non_record_16mb
The Idea: What if a Video Rendering Engine Architecture Could Train Transformers Faster?
This submission started with a question from a different domain entirely.
While designing a tile-based GPU rendering engine for real-time video rendering -- where 4K frames are split into tiles that fit in L2 cache and multiple operations (color correction, blur, sharpen) are fused within each tile to avoid VRAM bandwidth bottlenecks -- I realized the same memory hierarchy problem exists in transformer training: intermediate activations are written to HBM between every operation, even when the next operation immediately reads them back.
The video rendering's solution: keep data in fast on-chip L2 cache, apply all operations there, write once. The transformer equivalent: keep the 1536-dim MLP intermediate in GPU registers, process it via tiled accumulation through the gate projection -> activation -> down projection chain, and never let it touch HBM.
This cross-domain transfer produced two novel contributions, an honest failure, and a key insight about GPU computing.
H100 Results (8xH100 SXM, 600s)
Seeds 42 and 2025 blocked by compute budget exhaustion. Awaiting grant approval for additional validation runs.
Comparison vs SOTA (PR #1019, 1.1147 BPB): +0.016 BPB, caused by 41% slower step time (122ms vs 86.7ms) resulting in 2,005 fewer training steps.
What Worked
What Didn't Work
tl.dotcalls cannot compete with cuBLAS's single large GEMM, which has decades of per-architecture tensor core optimization. This is worse than the 15% slowdown on consumer GPUs -- H100's stronger cuBLAS widens the gap.The Key Insight
The Tile Engine metaphor works perfectly for element-wise operations but not for matmul-dominated workloads. In video processing (all per-pixel ops), tiling into SRAM is optimal -- there are no matrix multiplications to compete with cuBLAS. In transformers (90% matmul by compute), the matmuls should be delegated to hardware-optimized libraries while tiling handles only the element-wise glue between them. The right strategy isn't to replace cuBLAS -- it's to partner with it.
Local Benchmarks (RTX 5070 Ti, 1 GPU, 500 steps)
Architecture
Based on PR #1019 (@abaybektursun). Same 11L/512d/8H/4KV architecture with all existing components (XSA-all, BigramHash 3072x112, EMA+SWA, Self-Gen GPTQ, Parallel Muon). Added: Triton MLP megakernel + fused attention preprocessing.
Credits
PR #1019 @abaybektursun (base), PR #1072 (Triton fusion baseline), PR #1105 (backward epilogue), PR #399 @abaybektursun (Parallel Muon), PR #493 @parinzee (LeakyReLU^2), PR #478 @gowtham0992 (XSA), PR #315 @jfprincz (Partial RoPE), PR #374 @unnir (VE), PR #162 @raahilshah (BigramHash), PR #535 @raahilshah (GPTQ)
See README.md for full technical details, learnings, and future directions.